1
Powering Research and Analytics
with a Data Lake and Hadoop
Session #35, February 12, 2019
Rajan Chandras, Director Data Architecture and Strategy, NYU Langone Health
Marilyn Campbell, Manager, Clinical Data Analytics, NYU Langone Health
2
Rajan Chandras, MS, MSc, PMP, PAHM
Has no real or apparent conflicts of interest to report.
Marilyn Campbell, MSc
Has no real or apparent conflicts of interest to report.
Conflict of interest
3
Learning objectives
The problem
Healthcare analytics and challenges to democratizing data
The solution approach
Hadoop and the NYU Langone data lake
User experience and use cases
Clinical Quality and Effectiveness
Cardiovascular repository
Predictive analytics
Ongoing work
Discussion
Agenda
4
Recognize the unique analytic needs of healthcare researchers,
clinical analysts, informaticists and data scientists
Compare and contrast different approaches to democratizing data
for researchers, clinical informaticists and data scientists
Discuss the benefits and challenges of using Hadoop for
enterprise analytics
Employ the Hadoop data management platform to implement a
"data lake
Learning objectives
5
Healthcare analytics: complex, unique
6
Challenges to democratizing data
Approach Limitations
User
access to multiple data
sources
Inefficient governance;
M repositories x N users
Network share
Not practical; not secure; no
value
add
Data virtualization/ federation
(a.k.a.
EII) Cannot scale; expensive
Conventional databases
Conformance
to complex data
models; expensive ETL; limited
scale; expensive at higher scale;
cannot handle different types of data
Conventional data
warehousing
Appliances
Expensive; vendor dependence
NoSQL databases
Not generic;
depends on use case
7
The Hadoop big data platform
2003
2004
2008
2009
2011
2018
Google File System
MapReduce
Cloudera
MAPR
Hortonworks
Cloudera +
Hortonworks
Servers + Storage + Databases + Query/Processing
8
Designed from the ground up for big data
Open source and “co-opetitive
Secure, scalable and resilient
On-premise or cloud
Not just storage but also compute
Support for streaming data
Can store and process varied types of data
Flexible, no pre-defined data models: files, SQL, NoSQL
Support for BI/analytic tools: JDBC/ODBC, SQL, SAS, Python, R…
Native and custom metadata
Hadoop for self-service analytics
9
Hadoop challenges
Complex platform with ever-growing portfolio of technologies
Not designed for transactional applications
Limited SQL capabilities, e.g. referential integrity, stored
procedures, updates, indexes
Tool integration takes patience and expertise
There’s a learning curve
Not a hammer for every nail
We are here
10
Data ingestion & provisioning
Lift and shift
Immutable & user workspaces
Self-service analytics
Access technologies
The NYULH data lake
Data governance
Integration with master data
Integration with reference data
Metadata and data lineage
Data lake explorer
11
Enterprise
Performance
Analysis and
Reporting (ePAR)
Data hive architecture
Cardiovascular Data
Repository (CVR)
External Encounters
and Data Sets
Predictive Analytics
Unit (PAU)
12
“Put the data in the hands of those best qualified to analyze it”
Having immediate use cases
Skilled in data management and use of analytic software such as R,
SQL, Python, SAS, other
Are motivated self-starters
Limitations
By design, the data lake is only useable by those with advanced
analytic skills and knowledge
Reliant on a motivated user community
Lack of documentation
Long term vision; impact on efficiency in reporting will take time
The user experience
13
Goal: self-service automated clinical quality reporting and analysis
Why Hadoop works
Centralized user access to multiple enterprise datasets and reference data
Access to clinical data not dependent on IT
Accessible via multiple analytic tools (SQL, SAS, R, Python, etc.)
Clinical analysis uses separate resources from enterprise production reporting
Use Cases
Base hospital encounter data for external reporting, including clinical data
Internal reporting using CMS metric logic and reference data
Clinical Quality and Effectiveness
14
Goal: unify disparate cardiology “data islands”, ease information sharing,
and enable data to shape cardiovascular practice and research
Why Hadoop works
Can readily absorb disparate pieces of data and enable fast data combination
No traditional data warehouse organizational walls
Easy to share analytic data sets; fosters community
Repository for archived cardiac registry datasets
Use cases
Data mine, merge, and access previously isolated large data set, e.g., EKG +
structured morphological heart characteristics to localize source of arrhythmia
Machine learning to predict clinical outcomes of cardiovascular disease
interventions, e.g., likelihood of success or complication of atrial fibrillation
ablations, tailored for patients using their EHR and imaging information (deep
clinical phenotype)
Cardiovascular quality improvement dashboards, e.g., care for heart failure or
hypertension
Cardiovascular Data Repository
15
Goal: translate clinical predictive models to the point of care; build,
implement, deploy, evaluate, monitor, and maintain machine learning based
clinical models
Why Hadoop works
Training sets for machine learning are data hungry
Building these datasets are resource intensive both in complexity of table joins
and raw volume
Past state: These intensive queries will sometimes get killed or not finish
because of competition with database production activities
Current state: The data lake and Hadoop allow us to quickly build datasets and
implement complex joins without competing against production level activities
Use Cases
Predict 2 month mortality risk at inpatient
Predict primary diagnosis of congestive heart failure using natural language
processing
Predict patients at risk of end state renal disease (i.e. dialysis) in the next year
Predictive Analytics Unit
16
Optimize platform capabilities
Create documentation
Integrate with enterprise data governance tool for:
Metadata
Data lineage
Reference data
Expand content
Establish as an operational and analytical enterprise data source
Ongoing work
17
Martha J. Radford, MD, Chief Quality Officer, Professor of Medicine
(Cardiology)
Jeff Shein, BA, Senior Director, Enterprise Data Warehousing and Analytics
Eugene Grossi, MD, Stephen B. Colvin Professor of Cardiothoracic Surgery
Jason Kreuter, PhD, Director, Data & Analytics, Research Associate Professor,
Dept. of Medicine
Lior Jankelson, MD, PhD, Assistant Professor of Medicine
Yindalon Aphinyanaphongs, MD, PhD, Director, Clinical Predictive Analytics
Unit, Assistant Professor, Population Health and Medicine
Swetha Nukala, MBBS, MPH, Department of Clinical Quality and Effectiveness
Satyaki Adusumally, MS, Medical Center Information Technology
Shekhar Vemuri, Chief Technology Officer, Clairvoyant LLC
Acknowledgments
18
Discussion
Challenges, Experiences, Questions
Analytic architectures
Big data technologies
Cloud vs. on premise
Master data management
Ontologies, vocabularies
and reference data mgmt.
Business glossaries
Thank You
Remember to complete the online session evaluation
Marilyn.Campbell@nyulangone.org | www.linkedin.com/in/marilynmcampbell
Rajan.Chandras@nyulangone.org | www.linkedin.com/in/rchandras
Metadata and data
lineage
Data governance
Shift in skills and tools
Democratizing data
How to win friends and
influence people